%Introduction

Modern processor designers have sufficient transistor
resources to include an increasing number of increasingly faster
cores in a processor design, but lack the power budgets to
scale aggressively in both per-core performance and number of cores
simultaneously~\cite{conservation-cores,isca11-darksilicon} -- although peak performance may
still scale~\cite{ieeemicro-sprinting}, sustainable performance is
increasingly constrained. Notably. for any given application, there are one or
more optimal throughput points in the space of per-processor
performance and the number of such processors given to that
application.  However, due to thermal, yield, and other constraints,
not all of these optimal points correspond to currently realizable
systems. Similarly, for metrics outside of performance, such as
average power, total energy, or silicon area, these points may be
quite diverse.

While Moore's law continues to advance for the time being, promising
access to progressively more parallel systems, limitations on the
balance between leakage and dynamic power, wires and logic, and
complexity and power density fundamentally couple the ability to
realize superior design points in current and future process
generations with the energy efficiency of the processors
employed. With the end of Dennard scaling~\cite{Dennard1974}, there is
increased interest in techniques~\cite{NTC-UMich} and
technologies~\cite{steepslope,mookerjea,ionescu-nems} that promise fundamentally more energy
efficient computation. However, many of these CMOS alternatives offer
their benefits in energy/instruction at the cost of greatly reduced
performance. NEMS relays, for example, have effectively zero leakage
power, but can only operate at frequencies orders of magnitude slower
than CMOS gates.

In this paper, we focus on the impact that emerging devices and
techniques will have on shifting which regions of the design space
correspond to systems that are both plausible for mass deployment
(i.e. they can be produced with meaningful yield and operate at
commonly accepted peak temperatures) and preferable (e.g. among
designs that can achieve the same performance in multiple
implementations, we would prefer the more efficient or the cheaper of
any two designs). In particular, we focus on the rapidly maturing
technique of 3D integration~\cite{ionescu-3D,yibo-yield-iccad} and the potential
benefits offered by designs built with \emph{Interband
  Heterojunction Tunnel Field Effect Transistors} (\emph{TFETs})~\cite{mookerjea,seabaugh,dac11}.

Both 3D integration and TFET designs offer the potential to extend the
maximum number of aggressive cores possible within a viable yield and
thermal budget. Yield decreases superlinearly with increases in
area~\cite{yibo-yield-iccad}, and communication costs among cores scale
poorly in planar designs~\cite{reetu-3d-cost}. Thus, 3D integration
offers a very direct means to achieve meaningfully higher core counts
in tightly integrated systems. However, moving to a 3D design
aggravates thermal limitations by placing both
additional heat sources and insulators between the cooling system and
lower layers in the stack. On the other hand, TFETs and other
steep-slope devices offer fundamental reductions in leakage currents
and switching energy at the cost of a more limited upper range of
operating frequencies. There is a natural synergy between 3D
integration and TFETs in that reducing the thermal density on each
layer by substituting TFET designs for CMOS will allow more layers
within the thermal budget, allowing 3D TFET based designs to scale to
sufficient parallelism to overcome limitations in the serial
performance of TFET based processors. However, while deploying TFET
based designs in a 3D architecture is conceptually appealing, many
questions regarding how best to design such a multiprocessor
(e.g. microarchitecture selection, performance targeting, scaling) have not been definitively answered.

The contributions of this paper are as follows:
\begin{itemize}
\item We conduct an extensive evaluation of the performance and
  energy tradeoffs among choices in device technologies,
  3D integration, microarchitecture, and scheduling under
  constraints imposed by realistic yield models, thermal bounds and
  exploitable application parallelism.
\item We show that, with 3D integration, steep-slope based devices are
  already plausible candidates for achieving peak performance for
  highly parallel applications.
\item We show how, with further technology scaling, the range of
  applications for which steep-slope devices are appropriate grows,
  while the portion of the design space where CMOS is optimal shrinks
  to a point where only a small number of CMOS cores may be desirable.
\item We present an intelligent scheduling approach for hybrid
  CMOS-TFET systems that allows less parallel applications to still
  achieve a significant fraction of their peak performance on a
  primarily TFET-based system.
\end{itemize} 

The remainder of the paper proceeds as
follows. Section~\ref{sec:motivation} motivates the opportunities for new technologies to expand the realm of viable latency and parallelism tradeoffs
Section~\ref{sec:background} provides background information on the
properties of TFETs and the modeling of TFET-based designs. Section~\ref{sec:technique} describes our approach to exploring
the design space, Section~\ref{sec:methodology} details our
methodology, and Section~\ref{sec:results} presents the results of our
investigations. Section~\ref{sec:related} reviews related work and
Section~\ref{sec:conclusion} concludes.

% LocalWords:  Dennard CMOS NEMS 3D tradeoffs microarchitecture 3D FET TFETs
% LocalWords:  TFET
